Task Run the following code to load packages.
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
For this example, we’re going to work on loading a simple table of data from the Bureau of Labor Statistics. This is a table of industry sectors (each with a two-digit NAICS code) that we could make use of in our analysis of PPP loan data.
Recall that our PPP loan data has six-digit NAICS codes for each industry, which allows us to identify the industry for each loan. For example 212221 is the code for “Gold Mining Industry”.
A six-digit NAICS code is the most specific. As we remove numbers from the right to create five-digit, four-digit, three-digit and two-digit codes, the industries they represent get broader. Here’s an example:
It might be useful to have a lookup table of those top-level, two-digit NAICS codes (also called sector codes) for our analysis, to help us answer questions about what specific top-level industries got loans. L
Let’s suppose we can’t find a table like that for download, but we do see a version on the BLS website at this URL: https://www.bls.gov/ces/naics/.
Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage includes a brief history of industry classification as well as a table for NAICS sectors. It displays the sector number and a description.
Task Run the following code to display an image showing what you should see on the web page.
We could get this table into R with the following manual steps: highlighting the text, copying it into Excel, saving it as a csv, and reading it into R.
Or, we could write a few lines of webscraping code to have R do that for us!
In this simple example, it’s probably faster to do it manually than have R do it for us. And this table is unlikely to change much in the future.
Why would we ever write code to grab a single table? There’s several reasons:
So, to scrape, the first thing we need to do is start with the URL. Let’s store it as an object called naics_url.
Task Run the following code to store the URL.
naics_url <- "https://www.bls.gov/ces/naics/"
When we go to the web page, we can see a nicely-designed page that contains our information.
But what we really care about, for our purposes, is the html code that creates that page.
In our web browser, if we right-click anywhere on the page and select “view source” from the popup menu, we can see the source code. Or you can just copy this into Google Chrome: view-source:https://www.bls.gov/ces/naics/.
Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see the source code. I can see the different ‘class’ tags and what the coders decided to label them as. I can also see comment code, which, similar to how it works in R, act as a guideline when writing code.
Here’s a picture of what some of the source code looks like.
Task Run the following code to display an image showing what you should see on the web page. We’ll use those HTML tags – things like
<div> and <a> and <table> – to grab the info we need.
Okay, step 1.
Let’s write a bit of code to tell R to go to the URL for the page and ingest all of that HTML code. In the code below, we’re starting with our URL and using the read_html() function from rvest to ingest all of the page html, storing it as an object called naics_industry.
Task Run the following code to read in the html. Briefly describe the output that appears below the codeblock. Answer The output displays the html code that we read in. It shows the head and body tags as well as what language the code’s in.
# read in the html
naics_industry <- naics_url %>%
read_html()
# display the html below
naics_industry
## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="layout-fixed">\n\t\t\t\t<section aria-label="content"><!-- P ...
If you’re running this code in R Studio, in our environment window at right, you’ll see naics_industry as a “list of 2”.
This is not a dataframe, it’s a different type of data structure a “nested list.”
If we click on the name “naics_industry” in our environment window, we can see that it’s pulled in the html and shown us the general page structure.
Task Follow the directions in the previous paragraph to click on naics_industry in the environment window. Briefly describe what you see there. Answer I see the name, type, and category of the name “naics_industry”. Types include list and character, and values include “list of length 2” and ‘en-us’.
Nested within the <html> tag is the <head> and <body>, the two fundamental sections of most web pages. We’re going to pull information out of the <body> tag in a bit.
Task Run the following code to display an image showing what you should see when clicking on naics in the environment window.
Now, our task is to just pull out the section of the html that contains the information we need.
But which part do we need from that mess of html code? To figure that out, we can go back to the page in a web browser like chrome, and use built in developer tools to “inspect” the html code underlying the page.
On the page, find the data we want to grab – “Table 2. NAICS Sectors” - and right click on the word “Sector” in the column header of the table. That will bring up a dropdown menu. Select “Inspect”, which will pop up a window called the “element inspector” that shows us where different elements on the page are located, what html tags created those elements, and other info.
Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see the tag for the table and the other elements that go inside of it, such as classes and various headings.
Task Run the following code to display an image showing what you should see on the web page.
The entire table that we want of naics sectors is actually contained inside an html <table>. It has a header row <thead> that contains the column names and a <tbody> that contains one row <tr> per industry sector code.
Because it’s inside of a table, and not some other kind of element (like a <div>), rvest has a special function for easily extracting and converting html tables, called html_table(). This function extracts all six html tables on the page, only one of which we actually want.
Task Run the following code to process tables. Briefly describe the output that appears below the codeblock. Answer The output produced the tables from the website, including conversion implementation dates, sectors, and divisons.
# read in the html and extract all the tables
naics_industry <- naics_url %>%
read_html() %>%
html_table()
# display the tables below
naics_industry
## [[1]]
## # A tibble: 4 × 3
## Conversion `Reference Month Implemented` `Date Released`
## <chr> <chr> <chr>
## 1 SIC to NAICS 2002 May 2003 June 6, 2003
## 2 NAICS 2002 to NAICS 2007 January 2008 February 1, 2008
## 3 NAICS 2007 to NAICS 2012 January 2012 February 3, 2012
## 4 NAICS 2012 to NAICS 2017 January 2018 February 2, 2018
##
## [[2]]
## # A tibble: 21 × 2
## Sector Description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## # … with 11 more rows
##
## [[3]]
## # A tibble: 11 × 2
## Division Description
## <chr> <chr>
## 1 A Agriculture, Forestry, And Fis…
## 2 B Mining
## 3 C Construction
## 4 D Manufacturing
## 5 E Transportation, Communications…
## 6 F Wholesale Trade
## 7 G Retail Trade
## 8 H Finance, Insurance, And Real E…
## 9 I Services
## 10 J Public Administration
## 11 Source: www.osha.gov/pls/imis/sic_manual.html Source: www.osha.gov/pls/imis/…
##
## [[4]]
## # A tibble: 6 × 6
## SIC SIC SIC NAICS NAICS NAICS
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "Level" "Code1" "Example2" Level Code1 Example2
## 2 "Division" "Alpha" "D" Sector XX 31
## 3 "Major Group" "XX" "20" Subsector XXX 311
## 4 "Industry Group" "XXX" "203" Industry Group XXXX 3114
## 5 "Industry" "XXXX" "2037" NAICS Industry XXXXX 31141
## 6 "" "" "" U.S. Industry XXXXXX 311411
##
## [[5]]
## # A tibble: 4 × 4
## Positions `Example Value` `Field Name` `All Possible Values`
## <chr> <chr> <chr> <chr>
## 1 1-2 CE Survey Abbreviation CE
## 2 3 U Seasonal Adjustment Code S,U
## 3 4-11 31335311 Industry Code 00000000 through 90932999
## 4 12-13 01 Data Type Code 01 through 99
##
## [[6]]
## # A tibble: 4 × 4
## Positions `Example Value` `Field Name` `All Possible Values`
## <chr> <chr> <chr> <chr>
## 1 1-2 EE Survey Abbreviation EE
## 2 3 S Seasonal Adjustment Code S,U
## 3 4-9 000000 Industry Code 000000 through 959000
## 4 10-11 01 Data Type Code 01 through 83
In the environment window at right, look at naics_industry. Note that it’s now a “list of 6”.
Click on it to open it up.
Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see a list of 6 that includes name, type, and value. The value and list columns include how long each table is.
It should look like this.
Task Run the following code to display an image. This gets a little complicated, but what you’re seeing here is a nested list that contains six different data frames – also called tibbles – one for each table that exists on the web page we scraped.
They’re numbered 1 to 6. The first 1 has 4 rows and 3 columns, the second has 21 rows and 2 columns, and so on.
To examine what’s in each dataframe, mouse over the right edge (next to the word columns) on each row, and click the little scroll icon. The icon will be hidden until you mouse over it.
Click on the scroll icon for the first dataframe examine it.
Task Follow the directions in the previous paragraphs. Briefly describe what you see there. Answer Clicking on the scroll icon shows you the full table in a new window. The first dataframe is the conversion table.
Task Run the following code to display an image showing what you should see when you follow the steps in task 13.
That’s not the one we want!
Let’s try clicking on the scroll icon for row 2.
Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer The scroll opened the second dataframe which is the sector and its description.
Task Task Run the following code to display an image showing what you should see when you follow the steps in task 15.
That’s more like it! So, all we need to do now is to store that single dataframe as an object, and get rid of the rest. We can do that with this code, which says "keep only the second dataframe from our nested list. If we wanted to keep the third one, we’d change the number 2 to number 3.
Task Run the following code to keep only the table we want. Briefly describe the output that appears below the codeblock. Answer The output produces only the sector table.
# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry <- naics_url %>%
read_html() %>%
html_table()
# Just keep the second dataframe in our list
naics_industry <- naics_industry[[2]]
# show the dataframe
naics_industry
## # A tibble: 21 × 2
## Sector Description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## # … with 11 more rows
We now have a proper dataframe.
From here, we can do a little light cleaning. Let’s use clean_names() to standardize the column names. Then let’s use slice() to remove the last row – row number 21 – which contains source information that will complicate our use of this table later.
Task Run the following code to do light cleaning. Briefly describe the output that appears below the codeblock. Answer It’s a more clean and concise sector table. You don’t have to scroll left and right to see the different column headings.
# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry <- naics_url %>%
read_html() %>%
html_table()
# Just keep the second dataframe in our list, standardize column headers, remove last row
naics_industry <- naics_industry[[2]] %>%
clean_names() %>%
slice(-21)
# show the dataframe
naics_industry
## # A tibble: 20 × 2
## sector description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## 11 53 Real Estate and Rental and Leasing
## 12 54 Professional, Scientific, and Technical Services
## 13 55 Management of Companies and Enterprises
## 14 56 Administrative and Support and Waste Management and Remediation Servi…
## 15 61 Educational Services
## 16 62 Health Care and Social Assistance
## 17 71 Arts, Entertainment, and Recreation
## 18 72 Accommodation and Food Services
## 19 81 Other Services (except Public Administration)
## 20 92 Public Administration
And there we go. We now have a nice tidy dataframe of NAICS sector codes.
In the next chapter, we’ll look at a more complicated example.
In the last chapter, we demonstrated a fairly straightforward example of web scraping to grab a list of NAICS industry sector codes from the BLS website.
We’re going to graduate to a more challenging example, one that will help us gather information about the number of employees in each industry sector.
What makes this more challenging? Well, the information we need is all contained on multiple pages, one page per sector. We need to write code to visit each page, and then merge them into a single data frame. This is challenging stuff, so don’t feel dissuaded if it all doesn’t click the first time through. Like many things, web scraping is something that gets easier with lots of practice.
First we start with libraries, as we always do.
Task Run the following code to load packages.
library(rvest)
library(tidyverse)
library(janitor)
Now, let’s run the code we wrote in the last chapter, to get a tidy list of NAICS sector codes and names from https://www.bls.gov/ces/naics/.
Task Run the following code to load table from last chapter. Briefly describe the output that appears below the codeblock. Answer The output produces the clean sector table from above.
# Define url of page we want to scrape
naics_url <- "https://www.bls.gov/ces/naics/"
# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry <- naics_url %>%
read_html() %>%
html_table()
# Just keep the second dataframe in our list, standardize column headers, remove last row
naics_industry <- naics_industry[[2]] %>%
clean_names() %>%
slice(-21)
# show the dataframe
naics_industry
## # A tibble: 20 × 2
## sector description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## 11 53 Real Estate and Rental and Leasing
## 12 54 Professional, Scientific, and Technical Services
## 13 55 Management of Companies and Enterprises
## 14 56 Administrative and Support and Waste Management and Remediation Servi…
## 15 61 Educational Services
## 16 62 Health Care and Social Assistance
## 17 71 Arts, Entertainment, and Recreation
## 18 72 Accommodation and Food Services
## 19 81 Other Services (except Public Administration)
## 20 92 Public Administration
We’ll use this table to help us get to our end goal: a single dataframe with the number of employees in each industry sector.
It will look like this when we’re done.
Task Run the following code to display an image showing what you should see on the web page.
Unfortunately, that information doesn’t exist in a single tidy table on a single page we can scrape all at once. We’re going to have to scrape it from lots of different pages, and build it ourselves.
Let’s next take a look at the web page that has detailed employment information for one of our sectors, 22, Mining, Quarrying, and Oil and Gas Extraction.
We can find it here: https://www.bls.gov/iag/tgs/iag22.htm.
Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage is about the utilities sector. It has tables that show workforce statistics including employment numbers and unemployment rates.
A few scrolls down the page, there’s a table that has employee statistics.
The table is called “Employment and Unemployment”. There’s a row in the tabled for “Employment, all employees (seasonally adjusted)”. And in that row, there’s a value for the number of employees – in thousands – in June 2021.
The table shows that for the mining sector, it was 538.6 – or 538,600 – in June 2021. That’s the value we want to ingest in R.
Task Run the following code to display an image showing what you should see on the web page.
We don’t just want it for mining. We want it for all sectors!
But we’ll start by writing code just to get it from this one sector page, then modify that code to get it from every sector’s page
First, let’s define the URL of the page we want to get the information from.
Task Run the following code to store the URL.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"
Next, let’s read in the html of that page, and store it as an object called employment_info.
Task Run the following code to read in the html. Briefly describe the output that appears below the codeblock. Answer The employment html is displayed below the codeblock. We see the head and body tags.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"
# Get employment html
employment_info <- url %>%
read_html()
# Display it so we can see what it looks like
employment_info
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n<!-- ****************************************** Begin HEADER ** ...
Now, let’s set to picking out the information we need from the raw html.
We can use the web inspector in our web browser (Chrome) to figure out where the table is located.
Go to the web page and right click on the word “Data Series” in the table, then pick “inspect” to pull up the menu.
Notice two things. First, all of this information is contained in a proper html <table>. And that table has an id property of “iag22emp1”. Designers use these IDs to help style the page, to target certain elements with CSS. We can use it to scrape.
Task Run the following code to display an image showing what you should see on the web page. Recall that in the last chapter, when we used the html_table() function, it pulled in every single table on the page, six in total.
Here, we can use that id property to pick out just the table we want, and leave all the others behind.
We do that with a new function from rvest called html_element(), employing a bit of information about that element stored in what’s called the xpath. Xpath is a query language that helps us write programs that target specific parts of web pages.
The syntax is a little unwieldy, I know.
But essentially what the html_element function says is “find the html element that has an id of iag22emp1, using the xpath method, and get rid of all other elements”.
Task Run the following code to get html_element with info we need. Briefly describe the output that appears below the codeblock. Answer We now see the caption, thead, tbody, and tfoot tags in the output as opposed to the regular head and body tags. We isolated the table.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"
# Get employment html page and select only the table with employment information
employment_info <- url %>%
read_html() %>%
html_element(xpath = '//*[@id="iag22emp1"]')
# Display it so we can see what it looks like
employment_info
## {html_node}
## <table class="regular" id="iag22emp1">
## [1] <caption><span class="tableTitle"></span></caption>
## [2] <thead><tr>\n<th class="stubhead" id="iag22emp1.h.1.1">Data series</th>\r ...
## [3] <tbody>\n<tr>\n<th headers="iag22emp1.h.1.1" id="iag22emp1.r.1"><p class= ...
## [4] <tfoot><tr class="footnotes">\n<td class="footnotes" colspan="6">\r\n\t<p ...
We’ve now isolated the table on the page that contains the information we need, and gotten rid of everything else.
From here, we can use the html_tables() function to transform it from messy html code to a proper dataframe.
Task Run the following code to convert to table. Briefly describe the output that appears below the codeblock. Answer The data series table appears.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"
# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
read_html() %>%
html_element(xpath = '//*[@id="iag22emp1"]') %>%
html_table()
# Display it so we can see what it looks like
employment_info
## # A tibble: 6 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thousa… "" "" "" "" ""
## 2 Employment, all emplo… "" "538.6" "538.9" "(p)538.1" "(p)537.6"
## 3 Employment, productio… "" "428.1" "429.1" "(p)427.4" "(p)427.1"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "5.1%" "3.4%" "2.7%" "3.1%"
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
Now we have a proper dataframe of 6 rows and 6 columns.
It has much more information than we need, so let’s clean it up to isolate only the “Employment, all employees (seasonally adjusted)” value for June 2021.
Use clean_names() to standardize the column names, use slice() to keep only the second row, and use select() to keep two columns data_series and jun_2021.
Task Run the following code to keep row 2 and light cleaning. Briefly describe the output that appears below the codeblock. Answer We have a cleaner table with just the row we want (the value for June 2021). It’s much easier to read.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag21.htm"
# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
read_html() %>%
html_element(xpath = '//*[@id="iag21emp1"]') %>%
html_table()
# Keep only second row with seasonally adjusted, bind back to each_row_df
employment_info <- employment_info %>%
clean_names() %>%
slice(2) %>%
select(data_series, jun_2021)
# Display it so we can see what it looks like
employment_info
## # A tibble: 1 × 2
## data_series jun_2021
## <chr> <chr>
## 1 Employment, all employees (seasonally adjusted) 586.7
Okay, so we’ve successfully obtained the employment numbers for one of our sectors. That’s great.
But remember our original charge: to get a table with employment numbers for ALL sectors, not just one.
This is a little tricky, because, remember, the information for each sector is on a different page!
The info for mining is on this page: https://www.bls.gov/iag/tgs/iag21.htm.
Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage has information about the mining sector including a table on workforce statistics, showing employment/unemployment.
The info for construction is on this page: https://www.bls.gov/iag/tgs/iag23.htm.
Task Visit the web page linked above. Briefly describe what you see there. Answer There is a table on workforce statistics, showing employment/unemployment/job openings.
We have 20 sectors to get through.
We could get the info we need by copying the codeblock we just wrote 20 times, and change the url at the top each time.
But that’s not a great approach.
What if we needed to change the code? We’d need to change it 20 times! In programming, there’s a principle called “DRY” which stands for “Don’t Repeat Yourself”. If you find yourself copying the same code over and over again, with minor changes, it’s better to find a way to avoid that.
Fortunately, there’s a programming paradigm called “iteration” that is helpful here, using a method called a “for loop”.
Every programming language has its own version of a “for loop”, and R is no different.
A “for loop” says: “let’s take a list of things, and do the same thing to each item on that list.”
Let’s look at a very simple example to help illustrate the values of for loops.
We’re going to write code to print out 10 industry sectors.
First, let’s do it the repetitive way. We’re writing the same print function over and over, just changing the sector name each time.
Task Run the following code to print sectors. Briefly describe the output that appears below the codeblock. Answer It creates a list of the new sector names.
print("Agriculture, Forestry, Fishing and Hunting")
## [1] "Agriculture, Forestry, Fishing and Hunting"
print("Mining, Quarrying, and Oil and Gas Extraction")
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
print("Utilities")
## [1] "Utilities"
print("Construction")
## [1] "Construction"
print("Manufacturing")
## [1] "Manufacturing"
print("Wholesale Trade")
## [1] "Wholesale Trade"
print("Retail Trade")
## [1] "Retail Trade"
print("Transportation and Warehousing")
## [1] "Transportation and Warehousing"
print("Information")
## [1] "Information"
print("Finance and Insurance")
## [1] "Finance and Insurance"
We repeated print() 10 times, with minor modifications each time. Lots of repetition, which we seek to avoid if possible.
Now let’s look at how we might do that a little more efficiently with a “for loop.”
First let’s make a list of sectors, and save it as an object called “list_of_sectors.” The c() function tells R that we’re making a list.
Task Run the following code to make list of sectors. Briefly describe the output that appears below the codeblock. Answer Nothing appears below the code block. The list of sectors goes into the environment.
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")
And now let’s write a “for loop” to print out sector on that list.
Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer It shows the print list of sectors.
# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")
# Make a for loop and run it
for (sector in list_of_sectors) {
print(sector)
}
## [1] "Agriculture, Forestry, Fishing and Hunting"
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
## [1] "Utilities"
## [1] "Construction"
## [1] "Manufacturing"
## [1] "Wholesale Trade"
## [1] "Retail Trade"
## [1] "Transportation and Warehousing"
## [1] "Information"
## [1] "Finance and Insurance"
That’s many fewer lines of code. Let’s break down what we just saw, starting with for (sector in list_of sectors).
The information inside the parentheses tells R what list to use – list_of_sectors – and how to identify list elements later on – sector.
It’s important that the thing on the right side of “in” use the exact name of the list we want to loop through – in this case “list_of_sectors”.
If we try to feed it something different – say “sector_list” – it won’t work, because our actual list is called something else – “list_of_sectors”. This code throws an error.
Task Run the following code to use a for loop to print list of sectors that errors. Briefly describe the output that appears below the codeblock. Answer It shows an error. The object ‘sector-list’ wasn’t found.
# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")
# For loop that refers to a list that doesn't exist!
for (sector in sector_list) {
print(sector)
}
## Error in eval(expr, envir, enclos): object 'sector_list' not found
The name on the left side of “in” – the word we’re assigning to represent each element – is totally arbitrary.
We could use any character string, even something simple like “x”.
What matters is that we use the same character string inside of the curly braces {}, the section of the “for loop” that tells R what to do to each element – in this case, print it out.
To illustrate this, note that the code works just fine if we change it to say this:
Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer The code printed the list of sectors.
# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")
# For loop with x that stands in for each element in our list, instead of sector
for (x in list_of_sectors) {
print(x)
}
## [1] "Agriculture, Forestry, Fishing and Hunting"
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
## [1] "Utilities"
## [1] "Construction"
## [1] "Manufacturing"
## [1] "Wholesale Trade"
## [1] "Retail Trade"
## [1] "Transportation and Warehousing"
## [1] "Information"
## [1] "Finance and Insurance"
But it does NOT work if we call each element one thing – x – in the first line of our “for loop”, and use a different name to refer to it inside of the curly braces.
In this code below, it has no idea what we mean by “sector_name”, because we haven’t defined that anywhere.
Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer It shows an error, as the object ‘sector name’ cannot be found.
# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")
# For loop that includes instructions that refer to a variable that doesn't exist.
for (x in list_of_sectors) {
print(sector_name)
}
## Error in print(sector_name): object 'sector_name' not found
We can also write for loops to iterate over a range of numbers, instead of a list of characters. The syntax is a little different.
The code below says: “for each number in a range of numbers from 1 to 10, print the number.”
Task Run the following code to use a for loop to print 1 to 10. Briefly describe the output that appears below the codeblock. Answer The code printed the numbers 1-10.
# For loop that includes instructions that refer to a variable that doesn't exist.
for (number in 1:10) {
print(number)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Here’s a minor variation on that approach that we’ll make use of below.
Instead of giving the for loop an explicit number range, like 1:10, we can tell it to use 1 to “the number of rows in a dataframe” as our list of things to loop through.
Remember the naics_industry dataframe we loaded first? It has 20 rows. ### Task 21: Run code to display naics_industry Task Run the following code to display naics_industry. Briefly describe the output that appears below the codeblock. Answer It shows the naics_industry table.
naics_industry
## # A tibble: 20 × 2
## sector description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## 11 53 Real Estate and Rental and Leasing
## 12 54 Professional, Scientific, and Technical Services
## 13 55 Management of Companies and Enterprises
## 14 56 Administrative and Support and Waste Management and Remediation Servi…
## 15 61 Educational Services
## 16 62 Health Care and Social Assistance
## 17 71 Arts, Entertainment, and Recreation
## 18 72 Accommodation and Food Services
## 19 81 Other Services (except Public Administration)
## 20 92 Public Administration
We can use that information in our for loop by using the nrow() function, which calculates the number of rows in a dataframe. Here’s a quick demonstration of how that works.
Task Run the following code to display number of rows naics_industry. Briefly describe the output that appears below the codeblock. Answer The code shows that there are 20 rows in naics_industry.
nrow(naics_industry)
## [1] 20
To put it all together, the code below says “make a list of numbers that starts at 1 and ends at the number of rows in the naics_industry dataframe (which is 20), then print out each of these numbers.”
Task Run the following code to to print each row number in naics_industry. Briefly describe the output that appears below the codeblock. Answer The output shows a rprint of the list of 20 rows in naics_industry,
# For loop that includes instructions that refer to a variable that doesn't exist.
for (row_number in 1:nrow(naics_industry)) {
print(row_number)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
These were basic examples of how “for loops” work. Next, we’ll learn to apply “for loops” to efficentily extract information from multiple web pages.
First, let’s look at the codeblock we wrote earlier to extract the number of employees in the mining sector.
Task Run the following code to load table from earlier. Briefly describe the output that appears below the codeblock. Answer It shows the number of employees in the mining sector in June 2021.
# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag21.htm"
# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
read_html() %>%
html_element(xpath = '//*[@id="iag21emp1"]') %>%
html_table()
# Keep only second row with seasonally adjusted, bind back to each_row_df
employment_info <- employment_info %>%
clean_names() %>%
slice(2) %>%
select(data_series, jun_2021)
# Display it so we can see what it looks like
employment_info
## # A tibble: 1 × 2
## data_series jun_2021
## <chr> <chr>
## 1 Employment, all employees (seasonally adjusted) 586.7
This contains all the steps we needed to extract the information from one sector page. We’re now going to modify this function so we can use it to extract information from each sector page, writing code that keeps us from repeating ourselves too much.
First, we need to build a list of URLs to loop through in a “for loop.” We can do that using the dataframe we made in the last chapter. ### Task 25: Run code to load the table from the last chapter Task Run the following code to load table from last chapter. Briefly describe the output that appears below the codeblock. Answer It shows the sector table from the last chapter.
# Define url of page we want to scrape
naics_url <- "https://www.bls.gov/ces/naics/"
# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry <- naics_url %>%
read_html() %>%
html_table()
# Just keep the second dataframe in our list, standardize column headers, remove last row
naics_industry <- naics_industry[[2]] %>%
clean_names() %>%
slice(-21)
# show the dataframe
naics_industry
## # A tibble: 20 × 2
## sector description
## <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting
## 2 21 Mining, Quarrying, and Oil and Gas Extraction
## 3 22 Utilities
## 4 23 Construction
## 5 31-33 Manufacturing
## 6 42 Wholesale Trade
## 7 44-45 Retail Trade
## 8 48-49 Transportation and Warehousing
## 9 51 Information
## 10 52 Finance and Insurance
## 11 53 Real Estate and Rental and Leasing
## 12 54 Professional, Scientific, and Technical Services
## 13 55 Management of Companies and Enterprises
## 14 56 Administrative and Support and Waste Management and Remediation Servi…
## 15 61 Educational Services
## 16 62 Health Care and Social Assistance
## 17 71 Arts, Entertainment, and Recreation
## 18 72 Accommodation and Food Services
## 19 81 Other Services (except Public Administration)
## 20 92 Public Administration
This gives us the sector code and name for each industry.
Now let’s have a look at the URLs for a few of the pages we want to grab data from.
Notice a pattern?
They all start with “https://www.bls.gov/iag/tgs/iag”. The next bit of information is different for each one; with the two-digit sector code for each sector. The remainder is identical in all three links, “.htm”.
Because they’re all the same, we can use the information in the dataframe we just loaded to make all the URLs we need.
We’re going to use mutate() and paste0() to concatenate (mash together) the things that stay constant in every url (the beginning and end) with the things that are different (the sector number, stored in the column called sector).
Task Run the following code to build url. Briefly describe the output that appears below the codeblock. Answer It shows us the same table, but with an added sector_url column.
# Make a column with URL for each sector.
naics_industry <- naics_industry %>%
mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm"))
# Display it
naics_industry
## # A tibble: 20 × 3
## sector description sector_url
## <chr> <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting https://www.bls.gov/iag/…
## 2 21 Mining, Quarrying, and Oil and Gas Extracti… https://www.bls.gov/iag/…
## 3 22 Utilities https://www.bls.gov/iag/…
## 4 23 Construction https://www.bls.gov/iag/…
## 5 31-33 Manufacturing https://www.bls.gov/iag/…
## 6 42 Wholesale Trade https://www.bls.gov/iag/…
## 7 44-45 Retail Trade https://www.bls.gov/iag/…
## 8 48-49 Transportation and Warehousing https://www.bls.gov/iag/…
## 9 51 Information https://www.bls.gov/iag/…
## 10 52 Finance and Insurance https://www.bls.gov/iag/…
## 11 53 Real Estate and Rental and Leasing https://www.bls.gov/iag/…
## 12 54 Professional, Scientific, and Technical Ser… https://www.bls.gov/iag/…
## 13 55 Management of Companies and Enterprises https://www.bls.gov/iag/…
## 14 56 Administrative and Support and Waste Manage… https://www.bls.gov/iag/…
## 15 61 Educational Services https://www.bls.gov/iag/…
## 16 62 Health Care and Social Assistance https://www.bls.gov/iag/…
## 17 71 Arts, Entertainment, and Recreation https://www.bls.gov/iag/…
## 18 72 Accommodation and Food Services https://www.bls.gov/iag/…
## 19 81 Other Services (except Public Administratio… https://www.bls.gov/iag/…
## 20 92 Public Administration https://www.bls.gov/iag/…
While we’re at it, we’re going to use the same method to programatically build the “xpath” for the table on each sector page.
Recall that when we wrote our function that got information from just the mining page, the xpath targeted an element with an ID of “iag21emp1”. Why 21? That’s the sector code for mining.
If we look for that exact element ID on other sector pages, we won’t find it! That’s because it’s different for each page.
On the Utilities page (sector code 22), the ID for the table we want is “iag22emp1”. On the Construction page (sector code 23), it’s “iag23emp1”. We can also build this programatically, because it follows a predictable pattern.
Task Run the following code to build id. Briefly describe the output that appears below the codeblock. Answer Now we have a sector_xpath_id column in the same table as above.
# Make a column with URL and xpath ID for each sector
naics_industry <- naics_industry %>%
mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm")) %>%
mutate(sector_xpath_id =paste0("iag",sector,"emp1"))
# Display it
naics_industry
## # A tibble: 20 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing a… https://www.bls.gov… iag11emp1
## 2 21 Mining, Quarrying, and Oil and G… https://www.bls.gov… iag21emp1
## 3 22 Utilities https://www.bls.gov… iag22emp1
## 4 23 Construction https://www.bls.gov… iag23emp1
## 5 31-33 Manufacturing https://www.bls.gov… iag31-33emp1
## 6 42 Wholesale Trade https://www.bls.gov… iag42emp1
## 7 44-45 Retail Trade https://www.bls.gov… iag44-45emp1
## 8 48-49 Transportation and Warehousing https://www.bls.gov… iag48-49emp1
## 9 51 Information https://www.bls.gov… iag51emp1
## 10 52 Finance and Insurance https://www.bls.gov… iag52emp1
## 11 53 Real Estate and Rental and Leasi… https://www.bls.gov… iag53emp1
## 12 54 Professional, Scientific, and Te… https://www.bls.gov… iag54emp1
## 13 55 Management of Companies and Ente… https://www.bls.gov… iag55emp1
## 14 56 Administrative and Support and W… https://www.bls.gov… iag56emp1
## 15 61 Educational Services https://www.bls.gov… iag61emp1
## 16 62 Health Care and Social Assistance https://www.bls.gov… iag62emp1
## 17 71 Arts, Entertainment, and Recreat… https://www.bls.gov… iag71emp1
## 18 72 Accommodation and Food Services https://www.bls.gov… iag72emp1
## 19 81 Other Services (except Public Ad… https://www.bls.gov… iag81emp1
## 20 92 Public Administration https://www.bls.gov… iag92emp1
Lastly, we’re going to use filter to remove the “Public Administration” sector, because there’s no page for it. We’ll have to get that information some other way.
Task Run the following code to filter table. Briefly describe the output that appears below the codeblock. Answer The output shows the same table as above, just without the “Public Administration” sector.
# Make a column with URL and xpath ID for each sector, remove the Public Administration sector
naics_industry <- naics_industry %>%
mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm")) %>%
mutate(sector_xpath_id =paste0("iag",sector,"emp1")) %>%
filter(description != "Public Administration")
# Display it
naics_industry
## # A tibble: 19 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing a… https://www.bls.gov… iag11emp1
## 2 21 Mining, Quarrying, and Oil and G… https://www.bls.gov… iag21emp1
## 3 22 Utilities https://www.bls.gov… iag22emp1
## 4 23 Construction https://www.bls.gov… iag23emp1
## 5 31-33 Manufacturing https://www.bls.gov… iag31-33emp1
## 6 42 Wholesale Trade https://www.bls.gov… iag42emp1
## 7 44-45 Retail Trade https://www.bls.gov… iag44-45emp1
## 8 48-49 Transportation and Warehousing https://www.bls.gov… iag48-49emp1
## 9 51 Information https://www.bls.gov… iag51emp1
## 10 52 Finance and Insurance https://www.bls.gov… iag52emp1
## 11 53 Real Estate and Rental and Leasi… https://www.bls.gov… iag53emp1
## 12 54 Professional, Scientific, and Te… https://www.bls.gov… iag54emp1
## 13 55 Management of Companies and Ente… https://www.bls.gov… iag55emp1
## 14 56 Administrative and Support and W… https://www.bls.gov… iag56emp1
## 15 61 Educational Services https://www.bls.gov… iag61emp1
## 16 62 Health Care and Social Assistance https://www.bls.gov… iag62emp1
## 17 71 Arts, Entertainment, and Recreat… https://www.bls.gov… iag71emp1
## 18 72 Accommodation and Food Services https://www.bls.gov… iag72emp1
## 19 81 Other Services (except Public Ad… https://www.bls.gov… iag81emp1
We’re left with a dataframe of 19 rows and 4 columns. It now contains everything we need.
Next, we’ll construct a “for loop” to extract the info we need from each page. We’re going to build it up step-by-step, beginning with the the basic elements of our “for loop”.
The codeblock below says: “Make a list with the row numbers from 1 to the number of rows in our naics_industry dataframe (which is 19). Then, for each element of that list (1, 2, 3, 4, 5 and so on up to 19), use slice() to keep only the one row that matches that number and save this newly created dataframe as each_row_df. Print out the dataframe. Then go to the next element on the list and do the same thing. Keep doing that until we hit number 19, then stop.”
We get 19 dataframes, each with one row, one for each sector.
Task Run the following for loop and keep one row. Briefly describe the output that appears below the codeblock. Answer The output shows tables for each row that matches its respective number.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(each_row_df)
}
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 11 Agriculture, Forestry, Fishing and Hunting https://www… iag11emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 21 Mining, Quarrying, and Oil and Gas Extraction https://w… iag21emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 22 Utilities https://www.bls.gov/iag/tgs/iag22.htm iag22emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 23 Construction https://www.bls.gov/iag/tgs/iag23.htm iag23emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 31-33 Manufacturing https://www.bls.gov/iag/tgs/iag31-33.htm iag31-33emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 42 Wholesale Trade https://www.bls.gov/iag/tgs/iag42.htm iag42emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 44-45 Retail Trade https://www.bls.gov/iag/tgs/iag44-45.htm iag44-45emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 48-49 Transportation and Warehousing https://www.bls.gov/iag… iag48-49emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 51 Information https://www.bls.gov/iag/tgs/iag51.htm iag51emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 52 Finance and Insurance https://www.bls.gov/iag/tgs/iag52.htm iag52emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 53 Real Estate and Rental and Leasing https://www.bls.gov… iag53emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 54 Professional, Scientific, and Technical Services https://w… iag54emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 55 Management of Companies and Enterprises https://www.bl… iag55emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 56 Administrative and Support and Wa… https://www.bls.gov… iag56emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 61 Educational Services https://www.bls.gov/iag/tgs/iag61.htm iag61emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 62 Health Care and Social Assistance https://www.bls.gov/… iag62emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 71 Arts, Entertainment, and Recreation https://www.bls.go… iag71emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 72 Accommodation and Food Services https://www.bls.gov/ia… iag72emp1
## # A tibble: 1 × 4
## sector description sector_url sector_xpath_id
## <chr> <chr> <chr> <chr>
## 1 81 Other Services (except Public Administration) https://w… iag81emp1
We’re almost to the part where we can go out and fetch the html we need. Before we do that, let’s store as part of our loop an object called “url”, which contains the URL of the page for each sector.
The syntax with the dollar sign is a little funky, but “each_row_df$sector_url” says “from the each_row_df dataframe, grab the information in the sector_url column.” Because the column has only one row, there’s one value.
We’re going to do something simliar with the xpath for our employment table by using the information in the sector_xpath_id column.
That code also looks a little unwieldly. Recall that the xpath for the mining industry was '//*[@id="iag22emp1"]'.
In the code below, we’re building the xpath dynamically by pasting together the parts that stay the same for each xpath – '//*[@id="' and '"]' – and the parts that change for each sector, pulled from the xpath_sector_id column.
To see how this is working, we’re going to edit our print statement at the end a bit, printing the row_number and the dynamically created url and xpath.
Task Run the following for loop to store url and xpath value. Briefly describe the output that appears below the codeblock. Answer The output stored url and xpath value in the environment. It printed the row number and the url and xpath.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(paste0("ROW NUMBER:", row_number," URL: ",url," XPATH:",xpath_employment_table))
}
## [1] "ROW NUMBER:1 URL: https://www.bls.gov/iag/tgs/iag11.htm XPATH://*[@id=\"iag11emp1\"]"
## [1] "ROW NUMBER:2 URL: https://www.bls.gov/iag/tgs/iag21.htm XPATH://*[@id=\"iag21emp1\"]"
## [1] "ROW NUMBER:3 URL: https://www.bls.gov/iag/tgs/iag22.htm XPATH://*[@id=\"iag22emp1\"]"
## [1] "ROW NUMBER:4 URL: https://www.bls.gov/iag/tgs/iag23.htm XPATH://*[@id=\"iag23emp1\"]"
## [1] "ROW NUMBER:5 URL: https://www.bls.gov/iag/tgs/iag31-33.htm XPATH://*[@id=\"iag31-33emp1\"]"
## [1] "ROW NUMBER:6 URL: https://www.bls.gov/iag/tgs/iag42.htm XPATH://*[@id=\"iag42emp1\"]"
## [1] "ROW NUMBER:7 URL: https://www.bls.gov/iag/tgs/iag44-45.htm XPATH://*[@id=\"iag44-45emp1\"]"
## [1] "ROW NUMBER:8 URL: https://www.bls.gov/iag/tgs/iag48-49.htm XPATH://*[@id=\"iag48-49emp1\"]"
## [1] "ROW NUMBER:9 URL: https://www.bls.gov/iag/tgs/iag51.htm XPATH://*[@id=\"iag51emp1\"]"
## [1] "ROW NUMBER:10 URL: https://www.bls.gov/iag/tgs/iag52.htm XPATH://*[@id=\"iag52emp1\"]"
## [1] "ROW NUMBER:11 URL: https://www.bls.gov/iag/tgs/iag53.htm XPATH://*[@id=\"iag53emp1\"]"
## [1] "ROW NUMBER:12 URL: https://www.bls.gov/iag/tgs/iag54.htm XPATH://*[@id=\"iag54emp1\"]"
## [1] "ROW NUMBER:13 URL: https://www.bls.gov/iag/tgs/iag55.htm XPATH://*[@id=\"iag55emp1\"]"
## [1] "ROW NUMBER:14 URL: https://www.bls.gov/iag/tgs/iag56.htm XPATH://*[@id=\"iag56emp1\"]"
## [1] "ROW NUMBER:15 URL: https://www.bls.gov/iag/tgs/iag61.htm XPATH://*[@id=\"iag61emp1\"]"
## [1] "ROW NUMBER:16 URL: https://www.bls.gov/iag/tgs/iag62.htm XPATH://*[@id=\"iag62emp1\"]"
## [1] "ROW NUMBER:17 URL: https://www.bls.gov/iag/tgs/iag71.htm XPATH://*[@id=\"iag71emp1\"]"
## [1] "ROW NUMBER:18 URL: https://www.bls.gov/iag/tgs/iag72.htm XPATH://*[@id=\"iag72emp1\"]"
## [1] "ROW NUMBER:19 URL: https://www.bls.gov/iag/tgs/iag81.htm XPATH://*[@id=\"iag81emp1\"]"
Armed with the URL and xpath for each sector web page, we can now go out and get the employment table for each sector.
We’ll read in the html from the url we just stored; extract the table that has the xpath ID we just created; and then transform the html table code into a proper dataframe.
The dataframe is hidden inside a nested list, which we’ll have to extract in the next step.
So, when you run this code, it will print out 19 dataframes inside of nested lists, each containing one dataframe.
Task Run the following for loop to get tables. Briefly describe the output that appears below the codeblock. Answer The output shows 19 tables and it also printed the data series table.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
employment_info <- url %>%
read_html() %>%
html_elements(xpath = xpath_employment_table) %>%
html_table()
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(employment_info)
}
## [[1]]
## # A tibble: 2 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <lgl> <chr> <chr> <chr> <chr>
## 1 Unemployment NA "" "" "" ""
## 2 Unemployment rate NA "7.5%" "7.6%" "5.7%" "5.0%"
##
## [[1]]
## # A tibble: 6 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "586.7" "592.5" "(p)598.3" "(p)603.0"
## 3 Employment, produc… "" "427.6" "433.6" "(p)438.8" ""
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "10.3%" "8.9%" "10.2%" "7.3%"
## 6 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 6 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thousa… "" "" "" "" ""
## 2 Employment, all emplo… "" "538.6" "538.9" "(p)538.1" "(p)537.6"
## 3 Employment, productio… "" "428.1" "429.1" "(p)427.4" "(p)427.1"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "5.1%" "3.4%" "2.7%" "3.1%"
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "7,413" "7,425" "(p)7,425" "(p)7,447"
## 3 Employment, producti… "" "5,494" "5,517" "(p)5,510" "(p)5,518"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "7.5%" "6.1%" "4.6%" "4.5%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "344" "377" "(p)365" ""
## 8 Hires "" "441" "430" "(p)381" ""
## 9 Separations "" "352" "402" "(p)431" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "12,332" "12,389" "(p)12,42… "(p)12,44…
## 3 Employment, producti… "" "8,558" "8,617" "(p)8,641" "(p)8,655"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "5.4%" "4.2%" "3.6%" "3.9%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "937" "974" "(p)892" ""
## 8 Hires "" "554" "526" "(p)515" ""
## 9 Separations "" "426" "452" "(p)518" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "5,707.1" "5,722.1" "(p)5,718… "(p)5,735…
## 3 Employment, producti… "" "4,534.6" "4,547.8" "(p)4,549… "(p)4,568…
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "4.5%" "4.6%" "4.3%" "3.7%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "287" "291" "(p)263" ""
## 8 Hires "" "195" "199" "(p)201" ""
## 9 Separations "" "142" "177" "(p)215" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "15,361.4" "15,355.3" "(p)15,35… "(p)15,40…
## 3 Employment, producti… "" "13,057.1" "13,068.6" "(p)13,07… "(p)13,11…
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "6.6%" "6.4%" "6.5%" "6.1%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "1,228" "1,245" "(p)1,324" ""
## 8 Hires "" "1,171" "905" "(p)987" ""
## 9 Separations "" "861" "913" "(p)1,097" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 6 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thousa… "" "" "" "" ""
## 2 Employment, all emplo… "" "5,736.8" "5,792.9" "(p)5,847… "(p)5,894…
## 3 Employment, productio… "" "5,022.0" "5,052.6" "(p)5,074… "(p)5,102…
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "6.2%" "7.3%" "6.4%" "5.7%"
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "2,726" "2,745" "(p)2,774" "(p)2,806"
## 3 Employment, producti… "" "2,161" "2,183" "(p)2,201" "(p)2,233"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "6.1%" "5.6%" "4.4%" "4.0%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "152" "181" "(p)161" ""
## 8 Hires "" "96" "106" "(p)101" ""
## 9 Separations "" "93" "85" "(p)93" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 9 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "6,534.3" "6,545.1" "(p)6,546.… "(p)6,541…
## 3 Unemployment "" "" "" "" ""
## 4 Unemployment rate "" "2.8%" "3.1%" "3.5%" "2.0%"
## 5 Job openings, hire… "" "" "" "" ""
## 6 Job openings "" "229" "354" "(p)310" ""
## 7 Hires "" "156" "164" "(p)141" ""
## 8 Separations "" "140" "140" "(p)174" ""
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 9 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "2,271.8" "2,291.9" "(p)2,302.… "(p)2,308…
## 3 Unemployment "" "" "" "" ""
## 4 Unemployment rate "" "5.0%" "2.8%" "2.6%" "3.8%"
## 5 Job openings, hire… "" "" "" "" ""
## 6 Job openings "" "105" "199" "(p)167" ""
## 7 Hires "" "84" "89" "(p)74" ""
## 8 Separations "" "75" "82" "(p)71" ""
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 6 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "9,791.0" "9,839.0" "(p)9,897.… "(p)9,952…
## 3 Employment, produc… "" "7,513.7" "7,593.2" "(p)7,613.… ""
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "3.5%" "3.2%" "2.7%" "2.5%"
## 6 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 4 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "2,329.9" "2,329.8" "(p)2,336.… "(p)2,339…
## 3 Employment, produc… "" "1,533.4" "1,526.8" "(p)1,521.… ""
## 4 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 4 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "8,726.6" "8,770.3" "(p)8,790.… "(p)8,791…
## 3 Employment, produc… "" "7,732.8" "7,755.2" "(p)7,847.… ""
## 4 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 9 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "3,535.8" "3,576.1" "(p)3,625.… "(p)3,606…
## 3 Unemployment "" "" "" "" ""
## 4 Unemployment rate "" "6.4%" "6.7%" "6.1%" "3.9%"
## 5 Job openings, hire… "" "" "" "" ""
## 6 Job openings "" "202" "221" "(p)184" ""
## 7 Hires "" "134" "143" "(p)209" ""
## 8 Separations "" "111" "88" "(p)127" ""
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "20,001.2" "20,044.8" "(p)20,04… "(p)20,05…
## 3 Employment, produc… "" "17,590.4" "17,594.0" "(p)17,61… ""
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "3.9%" "3.8%" "3.9%" "3.1%"
## 6 Job openings, hire… "" "" "" "" ""
## 7 Job openings "" "1,479" "1,813" "(p)1,541" ""
## 8 Hires "" "713" "763" "(p)763" ""
## 9 Separations "" "721" "728" "(p)720" ""
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "2,037.6" "2,093.4" "(p)2,142… "(p)2,185…
## 3 Employment, produc… "" "1,927.0" "2,031.8" "(p)2,012… ""
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "10.3%" "8.3%" "7.7%" "7.2%"
## 6 Job openings, hire… "" "" "" "" ""
## 7 Job openings "" "223" "234" "(p)206" ""
## 8 Hires "" "303" "199" "(p)148" ""
## 9 Separations "" "99" "119" "(p)219" ""
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in tho… "" "" "" "" ""
## 2 Employment, all em… "" "12,762.9" "13,115.3" "(p)13,10… "(p)13,13…
## 3 Employment, produc… "" "11,417.1" "11,759.2" "(p)11,73… ""
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "11.0%" "9.2%" "9.4%" "7.8%"
## 6 Job openings, hire… "" "" "" "" ""
## 7 Job openings "" "1,552" "1,793" "(p)1,541" ""
## 8 Hires "" "1,462" "1,443" "(p)1,141" ""
## 9 Separations "" "886" "1,005" "(p)1,280" ""
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
##
## [[1]]
## # A tibble: 10 × 6
## `Data series` Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment (in thous… "" "" "" "" ""
## 2 Employment, all empl… "" "5,665" "5,709" "(p)5,743" "(p)5,727"
## 3 Employment, producti… "" "4,589" "4,642" "(p)4,676" "(p)4,657"
## 4 Unemployment "" "" "" "" ""
## 5 Unemployment rate "" "5.4%" "4.9%" "5.0%" "4.2%"
## 6 Job openings, hires,… "" "" "" "" ""
## 7 Job openings "" "431" "501" "(p)499" ""
## 8 Hires "" "375" "327" "(p)250" ""
## 9 Separations "" "202" "275" "(p)234" ""
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
In this next step, we use employment_info <- employment_info[[1]] to extract each dataframe from the nested list. Then we’ll tidy up the dataframe a bit. We’ll use the get rid of all the information we don’t need in the table, by using slice() to keep only the second row. We’ll also standardize the column names with clean_names().
Task Run the following for loop to clean up tables. Briefly describe the output that appears below the codeblock. Answer The output shows clean tables with the second row that we wanted.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
employment_info <- url %>%
read_html() %>%
html_elements(xpath = xpath_employment_table) %>%
html_table()
# Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row;
employment_info <- employment_info[[1]] %>%
clean_names() %>%
slice(2)
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(employment_info)
}
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <lgl> <chr> <chr> <chr> <chr>
## 1 Unemployment rate NA 7.5% 7.6% 5.7% 5.0%
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (seas… "" 586.7 592.5 (p)598.3 (p)603.0
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (seas… "" 538.6 538.9 (p)538.1 (p)537.6
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (seas… "" 7,413 7,425 (p)7,425 (p)7,447
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (sea… "" 12,332 12,389 (p)12,420 (p)12,4…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 5,707.1 5,722.1 (p)5,718… (p)5,735…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (s… "" 15,361.4 15,355.3 (p)15,351… (p)15,40…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 5,736.8 5,792.9 (p)5,847… (p)5,894…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (seas… "" 2,726 2,745 (p)2,774 (p)2,806
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 6,534.3 6,545.1 (p)6,546… (p)6,541…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 2,271.8 2,291.9 (p)2,302… (p)2,308…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 9,791.0 9,839.0 (p)9,897… (p)9,952…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 2,329.9 2,329.8 (p)2,336… (p)2,339…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 8,726.6 8,770.3 (p)8,790… (p)8,791…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 3,535.8 3,576.1 (p)3,625… (p)3,606…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (s… "" 20,001.2 20,044.8 (p)20,046… (p)20,05…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (se… "" 2,037.6 2,093.4 (p)2,142… (p)2,185…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (s… "" 12,762.9 13,115.3 (p)13,104… (p)13,13…
## # A tibble: 1 × 6
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, all employees (seas… "" 5,665 5,709 (p)5,743 (p)5,727
We now have 19 dataframes, each containing one row each and two columns, one of which is the employment number for a given sector for jun_2021. But we’re missing information about what industry sector these employment numbers represent.
We can add that back in by using bind_cols() to reconnect the each_row_df, which contains the sector code and the sector name.
Task Run the following for loop to add in data. Briefly describe the output that appears below the codeblock. Answer The output shows now shows the data we were missing. We can now see what industry sector these employment numbers represent in our tables.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
employment_info <- url %>%
read_html() %>%
html_elements(xpath = xpath_employment_table) %>%
html_table()
# Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table.
employment_info <- employment_info[[1]] %>%
clean_names() %>%
slice(2) %>%
bind_cols(each_row_df)
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(employment_info)
}
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Unemployment rate NA 7.5% 7.6% 5.7% 5.0% 11 Agricultur…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 586.7 592.5 (p)598.3 (p)603.0 21 Mining, Qua…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 538.6 538.9 (p)538.1 (p)537.6 22 Utilities
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 7,413 7,425 (p)7,425 (p)7,447 23 Constructi…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 12,332 12,389 (p)12,4… (p)12,4… 31-33 Manufactur…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 5,707.1 5,722.1 (p)5,71… (p)5,73… 42 Wholesale …
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 15,361.4 15,355.3 (p)15,3… (p)15,4… 44-45 Retail Tra…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 5,736.8 5,792.9 (p)5,847… (p)5,89… 48-49 Transporta…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 2,726 2,745 (p)2,774 (p)2,806 51 Information
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 6,534.3 6,545.1 (p)6,54… (p)6,54… 52 Finance an…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 2,271.8 2,291.9 (p)2,30… (p)2,30… 53 Real Estate…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 9,791.0 9,839.0 (p)9,89… (p)9,95… 54 Professiona…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 2,329.9 2,329.8 (p)2,33… (p)2,33… 55 Management …
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 8,726.6 8,770.3 (p)8,79… (p)8,79… 56 Administrat…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, a… "" 3,535.8 3,576.1 (p)3,62… (p)3,60… 61 Educationa…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 20,001.2 20,044.8 (p)20,0… (p)20,0… 62 Health Care…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 2,037.6 2,093.4 (p)2,14… (p)2,18… 71 Arts, Enter…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 12,762.9 13,115.3 (p)13,10… (p)13,1… 72 Accommodat…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
## data_series backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Employment, … "" 5,665 5,709 (p)5,743 (p)5,727 81 Other Servi…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
Then we’ll do a little bit of cleaning.
Let’s use parse_number() to remove the comma from the jun_2021 number and convert it from a character to number. We’ll use rename() to make the jun_2021 column name a little more descriptive. And then we’ll use select() to keep only the columns we want to keep – the sector number, the sector name, and the jun_2021 employment number.
Task Run the following for loop to clean up tables. Briefly describe the output that appears below the codeblock. Answer The clean tables are displayed. We have only the columns we wanted to keep.
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
employment_info <- url %>%
read_html() %>%
html_elements(xpath = xpath_employment_table) %>%
html_table()
# Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table; turn jun_2021 column into a proper number, and rename it. Then select only three columns we need.
employment_info <- employment_info[[1]] %>%
clean_names() %>%
slice(2) %>%
bind_cols(each_row_df) %>%
mutate(jun_2021 = parse_number(jun_2021)) %>%
rename(jun_2021_employees = jun_2021) %>%
select(sector,description,jun_2021_employees)
# To help us see what's happening as we build this, we're going to print the thing we're creating.
print(employment_info)
}
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 11 Agriculture, Forestry, Fishing and Hunting 7.5
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 21 Mining, Quarrying, and Oil and Gas Extraction 587.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 22 Utilities 539.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 23 Construction 7413
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 31-33 Manufacturing 12332
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 42 Wholesale Trade 5707.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 44-45 Retail Trade 15361.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 48-49 Transportation and Warehousing 5737.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 51 Information 2726
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 52 Finance and Insurance 6534.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 53 Real Estate and Rental and Leasing 2272.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 54 Professional, Scientific, and Technical Services 9791
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 55 Management of Companies and Enterprises 2330.
## # A tibble: 1 × 3
## sector description jun_2021_employe…
## <chr> <chr> <dbl>
## 1 56 Administrative and Support and Waste Management and … 8727.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 61 Educational Services 3536.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 62 Health Care and Social Assistance 20001.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 71 Arts, Entertainment, and Recreation 2038.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 72 Accommodation and Food Services 12763.
## # A tibble: 1 × 3
## sector description jun_2021_employees
## <chr> <chr> <dbl>
## 1 81 Other Services (except Public Administration) 5665
We’re getting very close to the finished table we showed at the beginning.
But right now, each bit of sector information is separated between 19 different dataframes.
We want them in one dataframe.
We can fix this by creating an empty dataframe called “employment_by_sector_all” using tibble(), placing it before our “for loop”.
And inside our “for loop” at the end, we’ll bind each employment_info dataframe to the newly created empty dataframe.
Task Run the following for loop combine tables into a single table. Briefly describe the output that appears below the codeblock. Answer Now, everything is in one dataframe.
# Create an empty dataframe to hold results
employment_by_sector_all <- tibble()
# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {
# Keep only the row for a given row number, get rid of every other row
each_row_df <- naics_industry %>%
slice(row_number)
# Define url of page to get
url <- each_row_df$sector_url
# Define id of table to ingest
xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')
# Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
employment_info <- url %>%
read_html() %>%
html_elements(xpath = xpath_employment_table) %>%
html_table()
# Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table; turn jun_2021 column into a proper number, and rename it. Then select only three columns we need.
employment_info <- employment_info[[1]] %>%
clean_names() %>%
slice(2) %>%
bind_cols(each_row_df) %>%
mutate(jun_2021 = parse_number(jun_2021)) %>%
rename(jun_2021_employees = jun_2021) %>%
select(sector,description,jun_2021_employees)
# Bind each individual employment info table to our employment_by_sector_all dataframe
employment_by_sector_all <- employment_by_sector_all %>%
bind_rows(employment_info)
}
# Display the completed dataframe
employment_by_sector_all
## # A tibble: 19 × 3
## sector description jun_2021_employe…
## <chr> <chr> <dbl>
## 1 11 Agriculture, Forestry, Fishing and Hunting 7.5
## 2 21 Mining, Quarrying, and Oil and Gas Extraction 587.
## 3 22 Utilities 539.
## 4 23 Construction 7413
## 5 31-33 Manufacturing 12332
## 6 42 Wholesale Trade 5707.
## 7 44-45 Retail Trade 15361.
## 8 48-49 Transportation and Warehousing 5737.
## 9 51 Information 2726
## 10 52 Finance and Insurance 6534.
## 11 53 Real Estate and Rental and Leasing 2272.
## 12 54 Professional, Scientific, and Technical Services 9791
## 13 55 Management of Companies and Enterprises 2330.
## 14 56 Administrative and Support and Waste Management and… 8727.
## 15 61 Educational Services 3536.
## 16 62 Health Care and Social Assistance 20001.
## 17 71 Arts, Entertainment, and Recreation 2038.
## 18 72 Accommodation and Food Services 12763.
## 19 81 Other Services (except Public Administration) 5665
Ta da! The end result is a nice tidy dataframe with the number of employees in June 2021 for each sector.
It’s always a good idea to spot check the results, especially any values that look suspiciously high or low.
The value for “Agriculture, Forestry, Fishing and Hunting” seems suspiciously low, compared with the other values.
Let’s figure out why.
Here’s the table on the mining sector page: https://www.bls.gov/iag/tgs/iag21.htm
Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage gives information about the Mining, Quarrying, and Oil and Gas Extraction sector. There is a table on workforce statistics, showing employment/unemployment.
Task Run the following code to display an image showing what you should see on the web page.
And here’s the table for the agriculture sector.
Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage gives information about the Agriculture, Forestry, Fishing and Hunting sector. There is a table on workforce statistics, showing the unemployment rate.
Task Run the following code to display an image showing what you should see on the web page.
Unlike mining – and every other sector page (I checked each page) – the agriculture page is structured differently.
In the second row of this table, it has the unemployment rate. Nowhere on the page can we find information on the number of employees. We would need to do additional research to track down a valid number if we plan on using this table, but for now we’re going to replace it with an NA using na_if.
Task Run the following code to remove agriculture from table. Briefly describe the output that appears below the codeblock. Answer It shows the sector table with an NA under the number of employees.
# remove the suspicious value for agriculture.
employment_by_sector_all <- employment_by_sector_all %>%
mutate(jun_2021_employees = na_if(jun_2021_employees,7.5))
# display it
employment_by_sector_all
## # A tibble: 19 × 3
## sector description jun_2021_employe…
## <chr> <chr> <dbl>
## 1 11 Agriculture, Forestry, Fishing and Hunting NA
## 2 21 Mining, Quarrying, and Oil and Gas Extraction 587.
## 3 22 Utilities 539.
## 4 23 Construction 7413
## 5 31-33 Manufacturing 12332
## 6 42 Wholesale Trade 5707.
## 7 44-45 Retail Trade 15361.
## 8 48-49 Transportation and Warehousing 5737.
## 9 51 Information 2726
## 10 52 Finance and Insurance 6534.
## 11 53 Real Estate and Rental and Leasing 2272.
## 12 54 Professional, Scientific, and Technical Services 9791
## 13 55 Management of Companies and Enterprises 2330.
## 14 56 Administrative and Support and Waste Management and… 8727.
## 15 61 Educational Services 3536.
## 16 62 Health Care and Social Assistance 20001.
## 17 71 Arts, Entertainment, and Recreation 2038.
## 18 72 Accommodation and Food Services 12763.
## 19 81 Other Services (except Public Administration) 5665
And we’re done.
A note about advanced scraping – every site is different. Every time you want to scrape a site, you’ll be puzzling over different problems. But the steps remain the same: find a pattern, exploit it, clean the data on the fly and put it into a place to store it.